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Abstract 

Convolutional neural networks (CNNs) have recently 
been very successful in a variety of computer vision tasks, 
especially on those linked to recognition. Optical flow esti¬ 
mation has not been among the tasks where CNNs were suc¬ 
cessful. In this paper we construct appropriate CNNs which 
are capable of solving the optical flow estimation problem 
as a supervised learning task. We propose and compare 
two architectures: a generic architecture and another one 
including a layer that correlates feature vectors at different 
image locations. 

Since existing ground truth datasets are not sufficiently 
large to train a CNN, we generate a synthetic Flying Chairs 
dataset. We show that networks trained on this unrealistic 
data still generalize very well to existing datasets such as 
Sintel and KITTI, achieving competitive accuracy at frame 
rates of 5 to 10 fps. 



Figure 1. We present neural networks which learn to estimate op¬ 
tical flow, being trained end-to-end. The information is first spa¬ 
tially compressed in a contractive part of the network and then 
refined in an expanding part. 


1. Introduction 

Convolutional neural networks have become the method 
of choice in many fields of computer vision. They are clas¬ 
sically applied to classification [25, 24], but recently pre¬ 
sented architectures also allow for per-pixel predictions like 
semantic segmentation [28] or depth estimation from single 
images [10]. In this paper, we propose training CNNs end- 
to-end to learn predicting the optical flow field from a pair 
of images. 

While optical flow estimation needs precise per-pixel lo¬ 
calization, it also requires finding correspondences between 
two input images. This involves not only learning image 
feature representations, but also learning to match them at 
different locations in the two images. In this respect, optical 


* Supported by the Deutsche Telekom Stiftung 
^ These authors contributed equally 


flow estimation fundamentally differs from previous appli¬ 
cations of CNNs. 

Since it was not clear whether this task could be solved 
with a standard CNN architecture, we additionally devel¬ 
oped an architecture with a correlation layer that explicitly 
provides matching capabilities. This architecture is trained 
end-to-end. The idea is to exploit the ability of convolu¬ 
tional networks to learn strong features at multiple levels of 
scale and abstraction and to help it with finding the actual 
correspondences based on these features. The layers on top 
of the correlation layer learn how to predict flow from these 
matches. Surprisingly, helping the network this way is not 
necessary and even the raw network can learn to predict op¬ 
tical flow with competitive accuracy. 

Training such a network to predict generic optical flow 
requires a sufficiently large training set. Although data aug¬ 
mentation does help, the existing optical flow datasets are 
still too small to train a network on par with state of the art. 


1 







Getting optical flow ground truth for realistic video material 
is known to be extremely difficult [7]. Trading in realism 
for quantity, we generate a synthetic Flying Chairs dataset 
which consists of random background images from Flickr 
on which we overlay segmented images of chairs from [1]. 
These data have little in common with the real world, but 
we can generate arbitrary amounts of samples with custom 
properties. CNNs trained on just these data generalize sur¬ 
prisingly well to realistic datasets, even without fine-tuning. 

Leveraging an efficient GPU implementation of CNNs, 
our method is faster than most competitors. Our networks 
predict optical flow at up to 10 image pairs per second on 
the full resolution of the Sintel dataset, achieving state-of- 
the-art accuracy among real-time methods. 

2. Related Work 

Optical Flow. Variational approaches have dominated 
optical flow estimation since the work of Horn and 
Schunck [19]. Many improvements have been introduced 
[29, 5, 34]. The recent focus was on large displacements, 
and combinatorial matching has been integrated into the 
variational approach [6, 35]. The work of [35] termed Deep- 
Matching and DeepFlow is related to our work in that fea¬ 
ture information is aggregated from fine to coarse using 
sparse convolutions and max-pooling. However, it does 
not perform any learning and all parameters are set man¬ 
ually. The successive work of [30] termed EpicFlow has 
put even more emphasis on the quality of sparse matching 
as the matches from [35] are merely interpolated to dense 
flow fields while respecting image boundaries. We only use 
a variational approach for optional refinement of the fiow 
field predicted by the convolutional net and do not require 
any handcrafted methods for aggregation, matching and in¬ 
terpolation. 

Several authors have applied machine learning tech¬ 
niques to optical fiow before. Sun et al. [32] study statis¬ 
tics of optical fiow and learn regularizers using Gaussian 
scale mixtures; Rosenbaum et al. [31] model local statis¬ 
tics of optical fiow with Gaussian mixture models. Black et 
al. [4] compute principal components of a training set of 
fiow fields. To predict optical fiow they then estimate coef¬ 
ficients of a linear combination of these ’basis fiows’. Other 
methods train classifiers to select among different inertial 
estimates [21] or to obtain occlusion probabilities [27]. 

There has been work on unsupervised learning of dis¬ 
parity or motion between frames of videos using neural 
network models. These methods typically use multiplica¬ 
tive interactions to model relations between a pair of im¬ 
ages. Disparities and optical fiow can then be inferred from 
the latent variables. Taylor et al. [33] approach the task 
with factored gated restricted Boltzmann machines. Konda 
and Memisevic [23] use a special autoencoder called ‘syn¬ 
chrony autoencoder’. While these approaches work well 


in a controlled setup and learn features useful for activity 
recognition in videos, they are not competitive with classi¬ 
cal methods on realistic videos. 

3. Network Architectures 

Convolutional Networks. Convolutional neural net¬ 
works trained with backpropagation [25] have recently been 
shown to perform well on large-scale image classification 
by Krizhevsky et al. [24]. This gave the beginning to a 
surge of works on applying CNNs to various computer vi¬ 
sion tasks. 

While there has been no work on estimating optical fiow 
with CNNs, there has been research on matching with neu¬ 
ral networks. Fischer et al. [12] extract feature represen¬ 
tations from CNNs trained in supervised or unsupervised 
manner and match these features based on Euclidean dis¬ 
tance. Zbontar and LeCun [36] train a CNN with a Siamese 
architecture to predict similarity of image patches. A dras¬ 
tic difference of these methods to our approach is that they 
are patch based and leave the spatial aggregation to post¬ 
processing, whereas the networks in this paper directly pre¬ 
dict complete fiow fields. 

Recent applications of CNNs include semantic segmen¬ 
tation [11, 15, 17, 28], depth prediction [10], keypoint pre¬ 
diction [17] and edge detection [13]. These tasks are simi¬ 
lar to optical fiow estimation in that they involve per-pixel 
predictions. Since our architectures are largely inspired by 
the recent progress in these per-pixel prediction tasks, we 
briefiy review different approaches. 

The simplest solution is to apply a conventional CNN in 
a ‘sliding window’ fashion, hence computing a single pre¬ 
diction (e.g. class label) for each input image patch [8, 11]. 
This works well in many situations, but has drawbacks: 
high computational costs (even with optimized implementa¬ 
tions involving re-usage of intermediate feature maps) and 
per-patch nature, disallowing to account for global output 
properties, for example sharp edges. Another simple ap¬ 
proach [17] is to upsample all feature maps to the desired 
full resolution and stack them together, resulting in a con¬ 
catenated per-pixel feature vector that can be used to predict 
the value of interest. 

Eigen et al. [10] refine a coarse depth map by training an 
additional network which gets as inputs the coarse predic¬ 
tion and the input image. Long et al. [28] and Dosovitskiy et 
al. [9] iteratively refine the coarse feature maps with the 
use of ‘upconvolutionaT layers ^ . Our approach integrates 
ideas from both works. Unlike Long et al., we ‘upconvolve’ 
not just the coarse prediction, but the whole coarse feature 
maps, allowing to transfer more high-level information to 
the fine prediction. Unlike Dosovitskiy et al., we concate- 

^ These layers are often named ’deconvolutionaT, although the opera¬ 
tion they perform is technically convolution, not deconvolution 
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Figure 2. The two network architectures: FlowNetSimple (top) and FlowNetCorr (bottom). 


nate the ‘upconvolution’ results with the features from the 
‘contractive’ part of the network. 

Convolutional neural networks are known to be very 
good at learning input-output relations given enough la¬ 
beled data. We therefore take an end-to-end learning ap¬ 
proach to predicting optical flow: given a dataset consisting 
of image pairs and ground truth flows, we train a network 
to predict the x-y flow flelds directly from the images. But 
what is a good architecture for this purpose? 

A simple choice is to stack both input images together 
and feed them through a rather generic network, allowing 
the network to decide itself how to process the image pair to 
extract the motion information. This is illustrated in Fig. 2 
(top). We call this architecture consisting only of convolu¬ 
tional layers ‘FlowNetSimple’. 

In principle, if this network is large enough, it could learn 
to predict optical flow. However, we can never be sure that a 
local gradient optimization like stochastic gradient descent 
can get the network to this point. Therefore, it could be ben- 
eflcial to hand-design an architecture which is less generic, 
but may perform better with the given data and optimization 
techniques. 

A straightforward step is to create two separate, yet iden¬ 
tical processing streams for the two images and to combine 
them at a later stage as shown in Fig. 2 (bottom). With 
this architecture the network is constrained to first produce 
meaningful representations of the two images separately 
and then combine them on a higher level. This roughly 


resembles the standard matching approach when one first 
extracts features from patches of both images and then com¬ 
pares those feature vectors. However, given feature repre¬ 
sentations of two images, how would the network And cor¬ 
respondences? 

To aid the network in this matching process, we intro¬ 
duce a ‘correlation layer’ that performs multiplicative patch 
comparisons between two feature maps. An illustration 
of the network architecture ‘FlowNetCorr’ containing this 
layer is shown in Fig. 2 (bottom). Given two multi-channel 
feature maps fi, f 2 : ^ with w, h, and c being their 

width, height and number of channels, our correlation layer 
lets the network compare each patch from fi with each path 
from f 2 . 

For now we consider only a single comparison of two 
patches. The ’correlation’ of two patches centered at xi in 
the first map and X 2 in the second map is then defined as 

c(xi,X2)= y] (fi(xi + o),f2(x2 + o)) (1) 

oG [—k,k] X [—k,k] 

for a square patch of size K := 2k 1. Note that Eq. 1 
is identical to one step of a convolution in neural networks, 
but instead of convolving data with a Alter, it convolves data 
with other data. For this reason, it has no trainable weights. 

Computing c(xi,X 2 ) involves c • K‘^ multiplications. 
Comparing all patch combinations involves w‘^ • h‘^ such 
computations, yields a large result and makes efficient for¬ 
ward and backward passes intractable. Thus, for computa- 
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Figure 3. Refinement of the coarse feature maps to the high reso¬ 
lution prediction. 

tional reasons we limit the maximum displacement for com¬ 
parisons and also introduce striding in both feature maps. 

Given a maximum displacement d, for each location xi 
we compute correlations c(xi, X 2 ) only in a neighborhood 
of size D := 2d + 1, by limiting the range of X 2 . We use 
strides si and S 2 , to quantize xi globally and to quantize X 2 
within the neighborhood centered around xi. 

In theory, the result produced by the correlation is four¬ 
dimensional: for every combination of two 2D positions we 
obtain a correlation value, i.e. the scalar product of the two 
vectors which contain the values of the cropped patches re¬ 
spectively. In practice we organize the relative displace¬ 
ments in channels. This means we obtain an output of size 
{w X h X D‘^). For the backward pass we implemented the 
derivatives with respect to each bottom blob accordingly. 

Refinement. CNNs are good at extracting high-level ab¬ 
stract features of images, by interleaving convolutional lay¬ 
ers and pooling, i.e. spatially shrinking the feature maps. 
Pooling is necessary to make network training computation¬ 
ally feasible and, more fundamentally, to allow aggregation 
of information over large areas of the input images. How¬ 
ever, pooling results in reduced resolution, so in order to 
provide dense per-pixel predictions we need a way to refine 
the coarse pooled representation. 

Our approach to this refinement is depicted in Figure 3. 
The main ingredient are ‘upconvolutionaT layers, consist¬ 
ing of unpooling (extending the feature maps, as opposed to 
pooling) and a convolution. Such layers have been used pre¬ 
viously [38, 37, 16, 28, 9]. To perform the refinement, we 
apply the ‘upconvolution’ to feature maps, and concatenate 
it with corresponding feature maps from the ’contractive’ 
part of the network and an upsampled coarser flow predic¬ 
tion (if available). This way we preserve both the high-level 
information passed from coarser feature maps and fine lo¬ 
cal information provided in lower layer feature maps. Each 
step increases the resolution twice. We repeat this 4 times, 
resulting in a predicted flow for which the resolution is still 
4 times smaller than the input. 

We discover that further reflnement from this resolution 
does not signiflcantly improve the results, compared to a 


Figure 4. The effect of variational refinement. In case of small 
motions (first row) the predicted flow is changed dramatically. For 
larger motions (second row), big errors are not corrected, but the 
flow field is smoothed, resulting in lower EPE. 


computationally less expensive bilinear upsampling to full 
image resolution. The result of this bilinear upsampling is 
the final flow predicted by the network. 

In an alternative scheme, instead of bilinear upsampling 
we use the variational approach from [6] without the match¬ 
ing term: we start at the 4 times downsampled resolution 
and then use the coarse to fine scheme with 20 iterations to 
bring the flow field to the full resolution. Finally, we run 5 
more iterations at the full image resolution. We additionally 
compute image boundaries with the approach from [26] and 
respect the detected boundaries by replacing the smooth¬ 
ness coefficient by a = ex.-p{—Xb{x,y)^), where b{x,y) 
denotes the thin boundary strength resampled at the respec¬ 
tive scale and between pixels. This upscaling method is 
more computationally expensive than simple bilinear up- 
sampling, but adds the benefits of variational methods to 
obtain smooth and subpixel-accurate flow fields. In the fol¬ 
lowing, we denote the results obtained by this variational 
refinement with a ‘-Hv’ suffix. An example of variational 
refinement can be seen in Fig. 4. 

4 . Training Data 

Unlike traditional approaches, neural networks require 
data with ground truth not only for optimizing several pa¬ 
rameters, but to learn to perform the task from scratch. In 
general, obtaining such ground truth is hard, because true 
pixel correspondences for real world scenes cannot easily be 
determined. An overview of the available datasets is given 
in Table 1. 



Frame 

pairs 

Frames with 
ground truth 

Ground truth 
density per frame 

Middlebury 

72 

8 

100% 

KITTI 

194 

194 

-50% 

Sintel 

1,041 

1,041 

100% 

Flying Chairs 

22,872 

22,872 

100% 


Table 1. Size of already available datasets and the proposed Flying 
Chairs dataset. 




4.1. Existing Datasets 

The Middlebury dataset [2] contains only 8 image pairs 
for training, with ground truth flows generated using four 
different techniques. Displacements are very small, typi¬ 
cally below 10 pixels. 

The KITTI dataset [14] is larger (194 training image 
pairs) and includes large displacements, but contains only a 
very special motion type. The ground truth is obtained from 
real world scenes by simultaneously recording the scenes 
with a camera and a 3D laser scanner. This assumes that the 
scene is rigid and that the motion stems from a moving ob¬ 
server. Moreover, motion of distant objects, such as the sky, 
cannot be captured, resulting in sparse optical flow ground 
truth. 

The MPI Sintel [7] dataset obtains ground truth from ren¬ 
dered artiflcial scenes with special attention to realistic im¬ 
age properties. Two versions are provided: the Final ver¬ 
sion contains motion blur and atmospheric effects, such as 
fog, while the Clean version does not include these effects. 
Sintel is the largest dataset available (1,041 training image 
pairs for each version) and provides dense ground truth for 
small and large displacement magnitudes. 

4.2. Flying Chairs 

The Sintel dataset is still too small to train large CNNs. 
To provide enough training data, we create a simple syn¬ 
thetic dataset, which we name Flying Chairs, by applying 
affine transformations to images collected from Flickr and 
a publicly available rendered set of 3D chair models [1]. 
We retrieve 964 images from Flickr^ with a resolution of 
1,024 X 768 from the categories ‘city’ (321), ‘landscape’ 
(129) and ‘mountain’ (514). We cut the images into 4 quad¬ 
rants and use the resulting 512 x 384 image crops as back¬ 
ground. As foreground objects we add images of multi¬ 
ple chairs from [1] to the background. From the original 
dataset we remove very similar chairs, resulting in 809 chair 
types and 62 views per chair available. Examples are shown 
in Figure 5. 

To generate motion, we randomly sample affine transfor¬ 
mation parameters for the background and the chairs. The 
chairs’ transformations are relative to the background trans¬ 
formation, which can be interpreted as both the camera and 
the objects moving. Using the transformation parameters 
we render the second image, the optical flow and occlusion 
regions. 

All parameters for each image pair (number, types, sizes 
and initial positions of the chairs; transformation parame¬ 
ters) are randomly sampled. We adjust the random distri¬ 
butions of these parameters in such a way that the result¬ 
ing displacement histogram is similar to the one from Sintel 

^Non-commercial public license. We use the code framework by Hays 
and Efros [18] 


(details can be found in the supplementary material). Us¬ 
ing this procedure, we generate a dataset with 22,872 im¬ 
age pairs and flow flelds (we re-use each background image 
multiple times). Note that this size is chosen arbitrarily and 
could be larger in principle. 

4.3. Data Augmentation 

A widely used strategy to improve generalization of neu¬ 
ral networks is data augmentation [24, 10]. Even though 
the Flying Chairs dataset is fairly large, we And that us¬ 
ing augmentations is crucial to avoid overfltting. We per¬ 
form augmentation online during network training. The 
augmentations we use include geometric transformations: 
translation, rotation and scaling, as well as additive Gaus¬ 
sian noise and changes in brightness, contrast, gamma, and 
color. To be reasonably quick, all these operations are pro¬ 
cessed on the GPU. Some examples of augmentation are 
given in Fig. 5. 

As we want to increase not only the variety of images 
but also the variety of flow flelds, we apply the same strong 
geometric transformation to both images of a pair, but ad¬ 
ditionally a smaller relative transformation between the two 
images. We adapt the flow fleld accordingly by applying the 
per-image augmentations to the flow fleld from either side. 

Speciflcally we sample translation from a the range 
[—20%, 20%] of the image width for x and y; rotation from 
[—17°, 17°]; scaling from [0.9, 2.0]. The Gaussian noise 
has a sigma uniformly sampled from [0, 0.04]; contrast is 
sampled within [—0.8, 0.4]; multiplicative color changes to 
the RGB channels per image from [0.5, 2]; gamma values 
from [0.7, 1.5] and additive brightness changes using Gaus¬ 
sian with a sigma of 0.2. 

5. Experiments 

We report the results of our networks on the Sintel, 
KITTI and Middlebury datasets, as well as on our synthetic 
Flying Chairs dataset. We also experiment with flne-tuning 
of the networks on Sintel data and variational reflnement of 
the predicted flow flelds. Additionally, we report runtimes 
of our networks, in comparison to other methods. 

5.1. Network and Training Details 

The exact architectures of the networks we train are 
shown in Fig. 2. Overall, we try to keep the architectures of 
different networks consistent: they have nine convolutional 
layers with stride of 2 (the simplest form of pooling) in six 
of them and a ReLU nonlinearity after each layer. We do not 
have any fully connected layers, which allows the networks 
to take images of arbitrary size as input. Convolutional Al¬ 
ter sizes decrease towards deeper layers of networks: 7x7 
for the first layer, 5 x 5 for the following two layers and 
3x3 starting from the fourth layer. The number of feature 
maps increases in the deeper layers, roughly doubling after 




Figure 5. Two examples from the Flying Chairs dataset. Generated image pair and color coded flow fleld (first three columns), augmented 
image pair and corresponding color coded flow fleld respectively (last three columns). 


each layer with a stride of 2. For the correlation layer in 
FlowNetC we chose the parameters k = 0, d = 20, si = 1, 
52 = 2. As training loss we use the endpoint error (EPE), 
which is the standard error measure for optical flow estima¬ 
tion. It is the Euclidean distance between the predicted flow 
vector and the ground truth, averaged over all pixels. 

Eor training CNNs we use a modified version of the 
caffe [20] framework. We choose Adam [22] as optimiza¬ 
tion method because for our task it shows faster conver¬ 
gence than standard stochastic gradient descent with mo¬ 
mentum. We fix the parameters of Adam as recommended 
in [22]: f3i = 0.9 and P 2 = 0.999. Since, in a sense, every 
pixel is a training sample, we use fairly small mini-batches 
of 8 image pairs. We start with learning rate A = le—4 
and then divide it by 2 every 100k iterations after the first 
300k. With ElowNetCorr we observe exploding gradients 
with A = le—4. To tackle this problem, we start by training 
with a very low learning rate A = le—6, slowly increase it 
to reach A = le—4 after 10k iterations and then follow the 
schedule just described. 

To monitor overfitting during training and fine-tuning, 
we split the Elying Chairs dataset into 22, 232 training and 
640 test samples and split the Sintel training set into 908 
training and 133 validation pairs. 

We found that upscaling the input images during testing 
may improve the performance. Although the optimal scale 
depends on the specific dataset, we fixed the scale once for 
each network for all tasks. Eor ElowNetS we do not upscale, 
for ElowNetC we chose a factor of 1.25. 

Fine-tuning. The used datasets are very different in terms 
of object types and motions they include. A standard so¬ 
lution is to fine-tune the networks on the target datasets. 
The KITTI dataset is small and only has sparse flow ground 
truth. Therefore, we choose to fine-tune on the Sintel train¬ 
ing set. We use images from the Clean and Einal versions 
of Sintel together and fine-tune using a low learning rate 
A = le—6 for several thousand iterations. Eor best perfor¬ 
mance, after defining the optimal number of iterations using 


a validation set, we then fine-tune on the whole training set 
for the same number of iterations. In tables we denote fine- 
tuned networks with a ‘+ft’ suffix. 

5.2. Results 

Table 2 shows the endpoint error (EPE) of our networks 
and several well-performing methods on public datasets 
(Sintel, KITTI, Middlebury), as well as on our Elying 
Chairs dataset. Additionally we show runtimes of different 
methods on Sintel. 

The networks trained just on the non-realistic Elying 
Chairs perform very well on real optical flow datasets, beat¬ 
ing for example the well-known EDGE [6] method. Af¬ 
ter fine-tuning on Sintel our networks can outperform the 
competing real-time method EPPM [3] on Sintel Pinal and 
KITTI while being twice as fast. 

Sintel. Prom Table 2 one can see that ElowNetC is better 
than ElowNetS on Sintel Clean, while on Sintel Pinal the 
situation changes. On this difficult dataset, PlowNetS+ft+v 
is even on par with DeepPlow. Since the average end¬ 
point error often favors over-smoothed solutions, it is in¬ 
teresting to see qualitative results of our method. Pigure 7 
shows examples of the raw optical flow predicted by the two 
ElowNetS (without fine-tuning), compared to ground truth 
and EpicPlow. The figure shows how the nets often pro¬ 
duce visually appealing results, but are still worse in terms 
of endpoint error. Taking a closer look reveals that one rea¬ 
son for this may be the noisy non-smooth output of the nets 
especially in large smooth background regions. This we can 
partially compensate with variational refinement. 

KITTI. The KITTI dataset contains strong projective 
transformations which are very different from what the net¬ 
works encountered during training on Plying Chairs. Still, 
the raw network output is already fairly good, and additional 
fine-tuning and variational refinement give a further boost. 
Interestingly, fine-tuning on Sintel improves the results on 











Method 

Sintel Clean 

Sintel Final 

KITTI 

Middlebury train 

Middlebury test 

Chairs 

Time (sec) 


train 

test 

train 

test 

train 

test 

AEE 

AAE 

AEE 

AAE 

test 

CPU 

GPU 

EpicFlow [30] 

2.40 

4.12 

3.70 

6.29 

3.47 

3.8 

0.31 

3.24 

0.39 

3.55 

2.94 

16 

- 

DeepFlow [35] 

3.31 

5.38 

4.56 

7.21 

4.58 

5.8 

0.21 

3.04 

0.42 

4.22 

3.53 

17 

- 

EPPM [3] 

- 

6.49 

- 

8.38 

- 

9.2 

- 

- 

0.33 

3.36 

- 

- 

0.2 

LDOF [6] 

4.29 

7.56 

6.42 

9.12 

13.73 

12.4 

0.45 

4.97 

0.56 

4.55 

3.47 

65 

2.5 

FlowNetS 

4.50 

7.42 

5.45 

8.43 

8.26 

- 

1.09 

13.28 

- 

- 

2.71 

- 

0.08 

FlowNetS-i-v 

3.66 

6.45 

4.76 

7.67 

6.50 

- 

0.33 

3.87 

- 

- 

2.86 

- 

1.05 

FlowNetS-i-ft 

(3.66) 

6.96 

(4.44) 

7.76 

7.52 

9.1 

0.98 

15.20 

- 

- 

3.04 

- 

0.08 

FlowNetS-i-ft-i-v 

(2.97) 

6.16 

(4.07) 

7.22 

6.07 

7.6 

0.32 

3.84 

0.47 

4.58 

3.03 

- 

1.05 

FlowNetC 

4.31 

7.28 

5.87 

8.81 

9.35 

- 

1.15 

15.64 

- 

- 

2.19 

- 

0.15 

FlowNetC-i-v 

3.57 

6.27 

5.25 

8.01 

7.45 

- 

0.34 

3.92 

- 

- 

2.61 

- 

1.12 

FlowNetC-i-ft 

(3.78) 

6.85 

(5.28) 

8.51 

8.79 

- 

0.93 

12.33 

- 

- 

2.27 

- 

0.15 

FlowNetC-i-ft-i-v 

(3.20) 

6.08 

(4.83) 

7.88 

7.31 

- 

0.33 

3.81 

0.50 

4.52 

2.67 

- 

1.12 


Table 2. Average endpoint errors (in pixels) of our networks compared to several well-performing methods on different datasets. The 
numbers in parentheses are the results of the networks on data they were trained on, and hence are not directly comparable to other results. 
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Figure 6. Examples of optical flow prediction on the Flying Chairs 
dataset. The images include flne details and small objects with 
large displacements which EpicFlow often fails to And. The net¬ 
works are much more successful. 


KITTI, probably because the images and motions in Sin- 
tel are more natural than in Flying Chairs. The FlowNetS 
outperforms FlowNetC on this dataset. 

Flying Chairs. Our networks are trained on the Flying 
Chairs, and hence are expected to perform best on those. 
When training, we leave aside a test set consisting of 640 
images. Table 2 shows the results of various methods on this 
test set, some example predictions are shown in Fig. 6. One 
can see that FlowNetC outperforms FlowNetS and that the 
nets outperform all state-of-the-art methods. Another inter¬ 
esting finding is that this is the only dataset where the varia¬ 
tional refinement does not improve performance but makes 
things worse. Apparently the networks can do better than 
variational refinement already. This indicates that with a 
more realistic training set, the networks might also perform 
even better on other data. 

Timings. In Table 2 we show the per-frame runtimes of 
different methods in seconds. Unfortunately, many meth¬ 
ods only provide the runtime on a single CPU, whereas our 
FlowNet uses layers only implemented on GPU. While the 
error rates of the networks are below the state of the art, 
they are the best among real-time methods. For both train¬ 


ing and testing of the networks we use an NVIDIA GTX Ti¬ 
tan GPU. The CPU timings of DeepFlow and EpicFlow are 
taken from [30], while the timing of LDOF was computed 
on a single 2.66GHz core. 

5.3. Analysis 

Training data. To check if we benefit from using the 
Flying Chairs dataset instead of Sintel, we trained a net¬ 
work just on Sintel, leaving aside a validation set to control 
the performance. Thanks to aggressive data augmentation, 
even Sintel alone is enough to learn optical fiow fairly well. 
When testing on Sintel, the network trained exclusively on 
Sintel has EPE roughly 1 pixel higher than the net trained 
on Flying Chairs and fine-tuned on Sintel. 

The Flying Chairs dataset is fairly large, so is data aug¬ 
mentation still necessary? The answer is positive: training 
a network without data augmentation on the Flying Chairs 
results in an EPE increase of roughly 2 pixels when testing 
on Sintel. 

Comparing the architectures. The results in Table 2 al¬ 
low to draw conclusions about strengths and weaknesses of 
the two architectures we tested. 

First, FlowNetS generalizes to Sintel Final better than 
FlowNetC. On the other hand, FlowNetC outperforms 
FlowNetS on Flying chairs and Sintel Clean. Note that Fly¬ 
ing Chairs do not include motion blur or fog, as in Sintel 
Final. These results together suggest that even though the 
number of parameters of the two networks is virtually the 
same, the FlowNetC slightly more overfits to the training 
data. This does not mean the network remembers the train¬ 
ing samples by heart, but it adapts to the kind of data it is 
presented during training. Though in our current setup this 
can be seen as a weakness, if better training data were avail¬ 
able it could become an advantage. 

Second, FlowNetC seems to have more problems with 
large displacements. This can be seen from the results 
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Figure 7. Examples of optical flow prediction on the Sintel dataset. In each row left to right: overlaid image pair, ground truth flow and 3 
predictions: EpicFlow, FlowNetS and FlowNetC. Endpoint error is shown for every frame. Note that even though the EPE of FlowNets is 
usually worse than that of EpicFlow, the networks often better preserve flne details. 


on KITTI discussed above, and also from detailed perfor¬ 
mance analysis on Sintel Final (not shown in the tables). 
FlowNetS+ft achieves an s40+ error (EPE on pixels with 
displacements of at least 40 pixels) of 43.3px, and for 
FlowNetC+ft this value is 48px. One explanation is that the 
maximum displacement of the correlation does not allow to 
predict very large motions. This range can be increased at 


the cost of computational efficiency. 

6. Conclusion 

Building on recent progress in design of convolutional 
network architectures, we have shown that it is possible to 
train a network to directly predict optical fiow from two in- 










put images. Intriguingly, the training data need not be re¬ 
alistic. The artificial Flying Chairs dataset including just 
affine motions of synthetic rigid objects is sufficient to pre¬ 
dict optical flow in natural scenes with competitive accu¬ 
racy. This proves the generalization capabilities of the pre¬ 
sented networks. On the test set of the Flying Chairs the 
CNNs even outperform state-of-the-art methods like Deep- 
Flow and EpicFlow. It will be interesting to see how future 
networks perform as more realistic training data becomes 
available. 
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Supplementary Material for ‘FlowNet: Learning Optical Flow with 

Convolutional Networks’ 


1. Flow field color coding 

To visualize the flow flelds, we use the tool provided with 
Sintel [7]. Flow direction is encoded with color and mag¬ 
nitude with color intensity. White corresponds to no mo¬ 
tion. Figure 1 illustrates flow color coding: the flow vector 
at each pixel is a vector from the center of the square to 
this pixel. Since the magnitudes of flows in different im¬ 
age pairs shown in the main paper are very different, we 
independently normalize the maximum color intensity for 
each image pair, but in the same way for different methods 
applied to one image pair. 

2. Details of generating Flying Chairs 



Figure 1. Flow field color coding. The central pixel does not move, 
and the displacement of every other pixel is the vector from the 
center to this pixel. 


We explain in detail the process of generating the Flying 
Chairs dataset. As background we use 964 images of reso¬ 
lution 1024 X 768 pixels, downloaded from Flickr. As fore¬ 
ground objects we use 809 chair models from the dataset of 
Aubry et al. [1], each rendered from 62 views: 31 azimuth 
angles and 2 elevation angles. To generate the first image in 
an image pair, we take a background image and randomly 
position a random set of chairs ontop. The number of the 
chairs is sampled uniformly from [16; 24], the types and 
viewpoints of the chairs are sampled uniformly and the lo¬ 
cations of the chairs are sampled uniformly from the whole 
image. The sizes of the chairs (in pixels) are sampled from 
a Gaussian with mean 200 and standard deviation 200, and 
then clamped between 50 and 640. 

To generate the second image in a pair and the flow field, 
we apply random transformations to the chairs and the back¬ 
ground. Each of these transformations is a composition of 
zooming, rotation and translation. The parameters to sam¬ 
ple are the zoom coefficient, the rotation angle and the trans¬ 
lation vector. We aim to roughly match the displacement 
distribution of Sintel, shown in Fig. 2 (left). Simply sam¬ 
pling the transformation parameters from Gaussians results 
in too few small displacements, we hence make the distri¬ 
butions of the transformation parameters to be more peaked 
around zero than Gaussians. 

The family of distributions from which we sample the 
parameters contains mixtures of two distributions: a con¬ 
stant ji with probability 1 — p and a power of a Gaussian 
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Figure 2. Histogram of displacement distribution in Sintel (left) 
and Flying Chairs (right) with linear (top) and logarithmic (bot¬ 
tom) y-axis. The distribution was cut off at the displacement of 
150 pixels, the maximum flow in Sintel is actually around 450 
pixels. 


with probability p. More precisely, let 7 ^ A/’(/i, a) be a 
univariate Gaussian. We raise its absolute value to a power 
k (keeping the sign) and clamp to the interval [a, b]. We 
then set the value to p with probability 1 — p. Overall, the 
result is given by: 

^ = /? . max(min(sign (7) . 17 ^ 6), a) + (1 - / 3 ) . fi, 

































Parameter 

k 

/^ 

a 

a 

b 

P 

Translation BG 

4 

0 

1.3 

-40 

40 

1 

Rotation BG 

2 

0 

1.3 

-10 

10 

0.3 

Zoom BG 

2 

1 

0.1 

0.93 

1.07 

0.6 

Translation CH 

3 

0 

2.3 

-120 

120 

1 

Rotation CH 

2 

0 

2.3 

-30 

30 

0.7 

Zoom CH 

2 

1 

0.18 

0.8 

1.2 

0.7 


Table 1. Parameters of the distribution of the transformation pa¬ 
rameters. 


where is a Bernoulli random variable equaling 1 with 
probability p and 0 with probability 1 — p. We denote the 
distribution of ^ by G(/c, /i, a, a, b, p). All transformation 
parameters are sampled from distributions from this family, 
with parameters shown in Table 1 . 

Given the transformation parameters, it is straightfor¬ 
ward to generate the second image in the pair, as well as 
the flow fleld and the occlusion map. We then cut each 
image into 4 quarters, resulting in 4 image pairs of size 
512 X 384 pixels each. The displacement histogram of the 
Flying Chairs dataset is shown in Fig. 2 (right). 

We did not study in detail the effect of the dataset pa¬ 
rameters on the FlowNet results. However, we observed, 
that with much simpler strategy of sampling all transforma¬ 
tion parameters from Gaussians the networks still work, but 
are less accurate than networks trained on the data described 
above. 

3. Convolutional Filters 

When taking a closer look at the Alters of the FlowNets, 
one can see that lower layer Alters have few structure and 
higher layer Alters are more structured. Fig. 3 shows how 
the flrst layer Alters have not completely converged, how¬ 
ever coarse gradients are visible. In contrast, the Alters that 
are applied to the output of the correlation layer have very 
visible structure, as shown in Fig. 5. Different Alters are 



Figure 4. Demo Video: Application of FlowNets to a live video 
stream. Watch on YouTube: http://goo.gl/YmMOkR 

selective for different flow directions and magnitudes. 

4. Video 

In the supplementary video we demonstrate the real-time 
operation of the FlowNets using a notebook with a GeForce 
GTX 980M GPU. Resolution of images captured with a 
webcam is 640 x 480 pixels. We show example flow flelds 
produced from real-life videos by both FlowNetSimple and 
FlowNetCorr for indoor and outdoor scenes. The video can 
be found on http://goo.gl/YmMOkR. A sample frame from 
the video can be seen in Fig. 4. 
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Figure 3. First layer Alters of FlowNetCorr. The Alters are noisy, 
but some structure is still visible. 
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Figure 5. Visualization of filters applied on top of the correlation layer in FlowNetCorr. There are 256 filters and for each of them we 
show the weights shaped as a 21 x 21 pixels patch, where each pixel corresponds to a displacement vector. The center pixel of each patch 
corresponds to zero displacement. The filters favor interesting unique displacement patterns. 





















































